Introduction

L’objet de cette page est de présenter l’approche adoptée pour analyser les donneés temporelles associées aux camions et captées par le système SIWIM, puis pour construire le meilleur modèle de prédiction possible.

Nous choisissons de poursuivre nos analyses sur la base du jeu de données dont les données manquantes ont été imputées :

## 
Read 65.2% of 183928 rows
Read 183928 rows and 39 (of 39) columns from 0.056 GB file in 00:00:03
variable classe first_values
V1 character 1, 2, 3, 4, 5, 6
Timestamp character 2017-11-23 00:00:42, 2017-11-23 00:21:31, 2017-11-23 00:21:32, 2017-11-23 00:24:14, 2017-11-23 00:24:53, 2017-11-23 00:30:35
Site_ID character Normandie, Normandie, Normandie, Normandie, Normandie, Normandie
Warning_flags character 00000000, 00000000, 00000000, 00000000, 00000000, 00000000
N integer 5, 2, 2, 5, 5, 5
Subclass_ID character 113, 40, 20, 113, 113, 74
Axle_groups character 113, 11, 11, 113, 113, 113
A1 double 3.89637, 3.53769, 2.37186, 3.80111, 3.7235, 3.84
A2 double 5.67876, NA, NA, 5.92265, 5.67742, 6
A3 double 1.32642, NA, NA, 1.28177, 1.32719, 1.4
A4 double 1.32642, NA, NA, 1.37017, 1.32719, 1.32
A5 double NA, NA, NA, NA, NA, NA
A6 double NA, NA, NA, NA, NA, NA
A7 double NA, NA, NA, NA, NA, NA
Date character 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23
Horaire character 00:00:42, 00:21:31, 00:21:32, 00:24:14, 00:24:53, 00:30:35
Annee character 2017, 2017, 2017, 2017, 2017, 2017
Mois_num character 11, 11, 11, 11, 11, 11
Mois_annee character novembre, novembre, novembre, novembre, novembre, novembre
Jour_num character 22, 22, 22, 22, 22, 22
Jour_semaine character mercredi, mercredi, mercredi, mercredi, mercredi, mercredi
Heure character 00, 00, 00, 00, 00, 00
M1 double 4.68117227319062, 1.95471967380224, 0.887510703363914, 5.09427115188583, 5.06275229357798, 4.55823649337411
M2 double 8.72759429153925, 2.87089704383282, 0.770931702344546, 8.10930682976555, 9.0244750254842, 8.61866462793068
M3 double 4.24332313965342, NA, NA, 1.74763506625892, 4.65559633027523, 2.85606523955148
M4 double 5.91610601427115, NA, NA, 3.56166156982671, 6.88427115188583, 4.34165137614679
M5 double 5.82585117227319, NA, NA, 3.37384301732926, 6.48813455657492, 3.96273190621814
M6 double NA, NA, NA, NA, NA, NA
M7 double NA, NA, NA, NA, NA, NA
M8 double NA, NA, NA, NA, NA, NA
Lane character droite, droite, droite, droite, droite, droite
Nb_axles character 5, 2, 2, 5, 5, 5
Anomalie character N, N, N, N, N, N
total_axle_dist double 12.228, 3.53769, 2.37186, 12.3757, 12.0553, 12.56
T double 12.9083, 12.8783, 12.8783, 12.91, 12.9167, 12.835
Reduced_chi_squared double 391.589, 4.25249, 13.8461, 401.025, 521.619, 184.582
Time_num double 0.905556905864198, 0.905597061471193, 0.905597093621399, 0.905602301954733, 0.905603555812757, 0.905614551183128
Vitesse double 76.40208, 74.09844, 74.09844, 81.46728, 67.95216, 73.728
MGV double 29.3940876656473, 4.82561671763507, 1.65844036697248, 21.886748216106, 32.1151885830785, 24.3373088685015

On effectue quelques conversions post import des données :

#Format date & time
siwim_data[, Timestamp := as.POSIXct(substr(siwim_data$Timestamp,1,19), format = '%Y-%m-%d %H:%M:%S')]

siwim_data[, Date := as.Date(siwim_data$Date)]

# Refactor time data

cols <- c(
  "Mois_annee",
  "Jour_semaine",
  "Heure",
  "Axle_groups"
)

siwim_data[, (cols) := lapply(.SD, function(x) as.factor(x)), .SDcols=cols]

# Heure au format numérique
siwim_data[,Heure_num := as.numeric(Heure)]

#Delete unuseful columns
maxN <- 8
axles_load <- paste("A", 1:(maxN-1), sep = "")
axles_mass <- paste("M", 1:maxN, sep = "")

useless_cols <- c(axles_load,axles_mass, "Site_ID")

siwim_data[, (useless_cols) := NULL]

# Structure apres conversions
#str(siwim_data)
data.frame(variable = names(siwim_data),
           classe = sapply(siwim_data, typeof),
           first_values = sapply(siwim_data, function(x) paste0(head(x),  collapse = ", ")),
           row.names = NULL) %>% 
  kable("html") %>%
  kable_styling()
variable classe first_values
V1 character 1, 2, 3, 4, 5, 6
Timestamp double 2017-11-23 00:00:42, 2017-11-23 00:21:31, 2017-11-23 00:21:32, 2017-11-23 00:24:14, 2017-11-23 00:24:53, 2017-11-23 00:30:35
Warning_flags character 00000000, 00000000, 00000000, 00000000, 00000000, 00000000
N integer 5, 2, 2, 5, 5, 5
Subclass_ID character 113, 40, 20, 113, 113, 74
Axle_groups integer 113, 11, 11, 113, 113, 113
Date double 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23
Horaire character 00:00:42, 00:21:31, 00:21:32, 00:24:14, 00:24:53, 00:30:35
Annee character 2017, 2017, 2017, 2017, 2017, 2017
Mois_num character 11, 11, 11, 11, 11, 11
Mois_annee integer novembre, novembre, novembre, novembre, novembre, novembre
Jour_num character 22, 22, 22, 22, 22, 22
Jour_semaine integer mercredi, mercredi, mercredi, mercredi, mercredi, mercredi
Heure integer 00, 00, 00, 00, 00, 00
Lane character droite, droite, droite, droite, droite, droite
Nb_axles character 5, 2, 2, 5, 5, 5
Anomalie character N, N, N, N, N, N
total_axle_dist double 12.228, 3.53769, 2.37186, 12.3757, 12.0553, 12.56
T double 12.9083, 12.8783, 12.8783, 12.91, 12.9167, 12.835
Reduced_chi_squared double 391.589, 4.25249, 13.8461, 401.025, 521.619, 184.582
Time_num double 0.905556905864198, 0.905597061471193, 0.905597093621399, 0.905602301954733, 0.905603555812757, 0.905614551183128
Vitesse double 76.40208, 74.09844, 74.09844, 81.46728, 67.95216, 73.728
MGV double 29.3940876656473, 4.82561671763507, 1.65844036697248, 21.886748216106, 32.1151885830785, 24.3373088685015
Heure_num double 1, 1, 1, 1, 1, 1

Calcul des données avec le pas temporel donné

Nous avons décider de partir sur une prévision dans l’heure, donc nous allons agréger les donnnées par date et par heure. Par cette opération, nous allons compter le nombre de camions, sommer les poids et les distances entre esseiux des camions et enfin prendre la moyenne de température, de la vitesse des camions

## Scaling data by day and hour (choosen time window)

siwim_data_hours <- siwim_data[, .(Count=.N, Total_Weight=sum(MGV),
                                   Total_axle_dist=sum(total_axle_dist), 
                                   T_mean=mean(T),
                                   Vitesse_mean=mean(Vitesse)), by = .(Date, Heure)]


#str(siwim_data_hours)
data.frame(variable = names(siwim_data_hours),
           classe = sapply(siwim_data_hours, typeof),
           first_values = sapply(siwim_data_hours, function(x) paste0(head(x),  collapse = ", ")),
           row.names = NULL) %>% 
  kable("html") %>%
  kable_styling()
variable classe first_values
Date double 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23, 2017-11-23
Heure integer 00, 01, 02, 03, 04, 05
Count integer 14, 11, 9, 24, 29, 45
Total_Weight double 275.833261977574, 292.930397553517, 283.591437308868, 501.946666666667, 604.64493374108, 916.674852191641
Total_axle_dist double 140.26137, 118.71912, 106.3454, 262.82355, 298.64435, 503.139055830308
T_mean double 12.8463, 12.7856, 12.7861222222222, 12.8314666666667, 13.0237965517241, 13.1572577777778
Vitesse_mean double 75.21444, 72.8242690909091, 67.67436, 79.000275, 77.4667117241379, 71.239432

Nous créeons ensuite un certain nombre de variables catégorielles et numériques autour du temps (année, mois, semaine, jour, jour de la semaine.

siwim_data_hours[, ':=' (Annee =  as.numeric(format(as.Date(Date), format = "%Y")),
                         Mois_annee = factor(format(as.Date(Date), format = "%B")),
                         Mois_num = as.numeric(format(as.Date(Date), format = "%m")),
                         Jour_num = as.numeric(format(as.Date(Date), format = "%d")),
                         Jour_sem_num = as.numeric(format(as.Date(Date), format = "%w")),
                         Semaine_num = as.numeric(format(as.Date(Date), format = "%V")),
                         Semaine_fac = factor(format(as.Date(Date), format = "%V")),
                         Heure_num = as.numeric(Heure)
                         )]

#Variable catégorielle du jour de la semaine
siwim_data_hours[, Jour_semaine := factor(weekdays(Date), 
                                          levels = c("lundi", "mardi", "mercredi",
                                                     "jeudi", "vendredi", "samedi",
                                                     "dimanche"))]

siwim_data_hours$Jour_semaine <- relevel(siwim_data_hours$Jour_semaine, ref = "lundi")

#Time step creation
siwim_data_hours[, time_step := as.POSIXct(paste(as.character(Date), paste(Heure, '00','00', sep = ':'), sep = " "))]

Voici une réprsentation graphique des différentes données :

Quelques indications de la disribution par jour et par heure.

#box plot par mois
#boxplot(Count ~ Mois_annee, data = siwim_data_hours)
ggplot(siwim_data_hours, aes(x=Mois_annee, y=Count, fill=Mois_annee)) +
  geom_boxplot() + scale_x_discrete(limits=c("juillet", "août", "septembre", "octobre", "novembre"))

#box plot par jour de la semaine
#boxplot(Count ~ Jour_semaine, data = siwim_data_hours)
ggplot(siwim_data_hours, aes(x=Jour_semaine, y=Count, fill=Jour_semaine)) +
  geom_boxplot() + scale_x_discrete(limits=c("lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi", "dimanche"))

#Box plot par Heure
#boxplot(Count ~ Heure, data = siwim_data_hours)
ggplot(siwim_data_hours, aes(x=Heure, y=Count, fill=Heure)) +
  geom_boxplot()

Geston des valeurs manquantes (dates et heures)

Plusieurs périodes montrent un manque de données dû surement à une défection du système SIWIM. L’idee de reconstituer les valeurs de fréquences par interpolation linéaire ou spline. Pou cela, nous allons compléter la série temporelle par les dates et heures manquantes.

##Get full sequence
full_sequence <- seq(from=min(siwim_data_hours$time_step), 
                     to=max(siwim_data_hours$time_step), by="h")

##Grab the missing sequence
missing_sequence <- full_sequence[!(full_sequence %in% siwim_data_hours$time_step)]

# missing_sequence[sample(.N,3)]
#sapply(missing_sequence[], sample, 10)
n_samples <- sample(1:length(missing_sequence), 30, replace = TRUE)
kable(missing_sequence[n_samples])
## Warning in kable_markdown(x = structure(c("2017-09-19 18:00:00",
## "2017-08-24 05:00:00", : The table should have a header (column names)
2017-09-19 18:00:00
2017-08-24 05:00:00
2017-09-15 04:00:00
2017-09-20 04:00:00
2017-10-06 16:00:00
2017-09-18 12:00:00
2017-09-17 08:00:00
2017-10-10 03:00:00
2017-09-15 09:00:00
2017-08-23 16:00:00
2017-07-19 21:00:00
2017-09-18 01:00:00
2017-09-12 15:00:00
2017-10-05 13:00:00
2017-10-13 20:00:00
2017-07-08 22:00:00
2017-08-26 00:00:00
2017-09-19 20:00:00
2017-10-10 09:00:00
2017-08-28 00:00:00
2017-08-26 13:00:00
2017-09-12 14:00:00
2017-09-19 18:00:00
2017-08-27 17:00:00
2017-10-06 11:00:00
2017-10-08 20:00:00
2017-08-27 06:00:00
2017-09-15 16:00:00
2017-09-20 05:00:00
2017-10-15 12:00:00

Nous construisons un nouveau data set uniquement pour les dates et heures manquantes. Les données catégorielles et numériques liées au temps sont déduites en même temps. Enfin, nous fusionnons le dataset de données existantes et celui de données manquantes.

siwim_data_hours_missing <- data.table(time_step = missing_sequence, 
                                       Date =  as.Date(strftime(missing_sequence, format = "%Y-%m-%d")),
                                       Annee = as.numeric(strftime(missing_sequence, format = "%Y")),
                                       Mois_annee = factor(strftime(missing_sequence, format = "%B")),
                                       Mois_num = as.numeric(strftime(missing_sequence, format = "%m")),
                                       Jour_num = as.numeric(strftime(missing_sequence, format = "%d")),
                                       Jour_sem_num = as.numeric(strftime(missing_sequence, format = "%w")))

siwim_data_hours_missing[, Jour_semaine := factor(weekdays(Date))]

## Données existantes et manquantes
siwim_data_hours_full <- rbindlist(list(siwim_data_hours, siwim_data_hours_missing), fill = T)

Voici les dates où les données manquent :

head(siwim_data_hours_full[siwim_data_hours_full$time_step %in% missing_sequence]) %>% kable()
Date Heure Count Total_Weight Total_axle_dist T_mean Vitesse_mean Annee Mois_annee Mois_num Jour_num Jour_sem_num Semaine_num Semaine_fac Heure_num Jour_semaine time_step
2017-07-08 NA NA NA NA NA NA 2017 juillet 7 8 6 NA NA NA samedi 2017-07-08 22:00:00
2017-07-09 NA NA NA NA NA NA 2017 juillet 7 9 0 NA NA NA dimanche 2017-07-09 05:00:00
2017-07-16 NA NA NA NA NA NA 2017 juillet 7 16 0 NA NA NA dimanche 2017-07-16 01:00:00
2017-07-16 NA NA NA NA NA NA 2017 juillet 7 16 0 NA NA NA dimanche 2017-07-16 02:00:00
2017-07-19 NA NA NA NA NA NA 2017 juillet 7 19 3 NA NA NA mercredi 2017-07-19 17:00:00
2017-07-19 NA NA NA NA NA NA 2017 juillet 7 19 3 NA NA NA mercredi 2017-07-19 18:00:00

Nous réalisons ensuite une interpolation linéaire et spline sur le dataset complet.

siwim_data_hours_count_full_2 <- na.interpolation(siwim_data_hours_full$Count)
siwim_data_hours_count_full_3 <- na.interpolation(siwim_data_hours_full$Count, option = "spline")

siwim_data_dt_full <- data.table(time_step = siwim_data_hours_full$time_step,
                                 full_interp_lm = siwim_data_hours_count_full_2,
                                 full_interp_spline = siwim_data_hours_count_full_3)

## Graphique des 2 interpolations
df <- melt(siwim_data_dt_full[, c("time_step", "full_interp_lm", 
                                  "full_interp_spline")], id="time_step")
ggplot(df) + geom_line(aes(x=time_step, y=value, color=variable))  + facet_wrap( ~ variable, scales="free")

L’interpolation spline semble mieux compléter la série temporelle.

Modèles classiques de séries temporelles

Avant d’utiliser d’appliquer un modèle classique sur les séries temporelles, nous devons vérifier si elle est stationnaire. Une série temporelle stationnaire doit satisfaire 3 critères : * sa moyenne doit être constante dans le temps * sa variance doit être indépendante du temps * la covariance entre termes doit être indépendante du temps.

Le test de Dickey-fuller

Le test de Dickey-fuller a pour hypothèse \(H_0\) qu’une racine unitaire est présente dans une modèle auto-regressif d’ordre 1. L’hypothèse alternative est la stationnarité de la série temporelle.

\(y_t = \rho y_{t-1} + u_t\). Une racine unitaire est présente si \(\rho = 1\).

Cette équation est reformulée par différence de \(y_t\) avec \(y_{t-1}\) : \(\nabla y_t = (\rho - 1) y_{t-1} + u_t = \delta y_{t-1} + u_t\)

L’hypothèse \(H_0\) devient : \(\delta = 0\). 2 autres versions du test existent : * avec une constante intiale : \(y_t = a_0 + \rho y_{t-1} + u_t\) * avec une tendance temporelle : \(y_t = a_0 + a_1 t+ \rho y_{t-1} + u_t\)

Le test augmenté de Dickey-fuller supprime tous les effets structurels dus aux autocorrelations d’ordre supérieur à 1.

# test Dickey Fuller for stationarity detection
X <- siwim_data_hours$Count
lags <- 0
z <- diff(X)
n <- length(z)
z.diff <- embed(z, lags+1)[,1]
z.lag.1 <- X[(lags+1):n]

summary(lm(z.diff~0+z.lag.1 ))
## 
## Call:
## lm(formula = z.diff ~ 0 + z.lag.1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -133.376   -8.007    0.321    9.376  163.289 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## z.lag.1 -0.032108   0.004815  -6.669 3.11e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.68 on 2726 degrees of freedom
## Multiple R-squared:  0.01605,    Adjusted R-squared:  0.01569 
## F-statistic: 44.47 on 1 and 2726 DF,  p-value: 3.113e-11
#Augmented Dickey Fuller
lags <- 1
z <- diff(X)
n <- length(z)
z.diff <- embed(z, lags+1)[,1]
z.lag.1 <- X[(lags+1):n]
k <- lags + 1
z.diff.lag <- embed(z, lags+1)[, 2:k]

summary(lm(z.diff~0+z.lag.1+z.diff.lag ))
## 
## Call:
## lm(formula = z.diff ~ 0 + z.lag.1 + z.diff.lag)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -136.027   -6.670    1.128    9.827  165.632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## z.lag.1    -0.041184   0.004658  -8.842   <2e-16 ***
## z.diff.lag  0.282663   0.018379  15.380   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.73 on 2724 degrees of freedom
## Multiple R-squared:  0.09467,    Adjusted R-squared:  0.094 
## F-statistic: 142.4 on 2 and 2724 DF,  p-value: < 2.2e-16
#Augmented Dickey Fuller with trend and drift
summary(lm(z.diff~1+z.lag.1+z.diff.lag ))
## 
## Call:
## lm(formula = z.diff ~ 1 + z.lag.1 + z.diff.lag)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -135.688  -10.127   -3.242    6.763  160.407 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.807625   0.619152    9.38   <2e-16 ***
## z.lag.1     -0.086047   0.006626  -12.99   <2e-16 ***
## z.diff.lag   0.305085   0.018249   16.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.37 on 2723 degrees of freedom
## Multiple R-squared:  0.123,  Adjusted R-squared:  0.1224 
## F-statistic:   191 on 2 and 2723 DF,  p-value: < 2.2e-16
time <- 1:length(z.diff)

#Augmented Dickey Fuller with trend and drift and time trend
summary(lm(z.diff~1+time+z.lag.1+z.diff.lag ))
## 
## Call:
## lm(formula = z.diff ~ 1 + time + z.lag.1 + z.diff.lag)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -135.650  -10.169   -3.222    6.654  160.380 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.0430908  0.9722388   6.216  5.9e-10 ***
## time        -0.0001711  0.0005446  -0.314    0.753    
## z.lag.1     -0.0860790  0.0066274 -12.988  < 2e-16 ***
## z.diff.lag   0.3050851  0.0182524  16.715  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.37 on 2722 degrees of freedom
## Multiple R-squared:  0.123,  Adjusted R-squared:  0.1221 
## F-statistic: 127.3 on 3 and 2722 DF,  p-value: < 2.2e-16
# with tseries function

#Simple Dickey_fuller test
adf.test(siwim_data_hours$Count, k = 0)
## Warning in adf.test(siwim_data_hours$Count, k = 0): p-value smaller than
## printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  siwim_data_hours$Count
## Dickey-Fuller = -9.6405, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
#Augmented Dickey-fuller test
adf.test(siwim_data_hours$Count, k = 168)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  siwim_data_hours$Count
## Dickey-Fuller = -3.6732, Lag order = 168, p-value = 0.02561
## alternative hypothesis: stationary
#Number of differences required for a stationary series
ndiffs(X)
## [1] 0

Analyse des autocorrélations et autocorrélations partielles

La fonction d’autocorrelation (ACF) mesure la correlation entre la série temporelle et elle-même décalée dans le temps (lags t-1, t-2, etc.). Par exemple, pour le lag d’ordre 5, ACF compare la série temporelle à l’instant t avec la série temporelle à l’instant t-5.

Pour une série de type moyenne mobile de lag n, il n’y aura aucune corrélation entre \(x(t)\) et \(x(t-n-1)\). Donc, le graphique d’autocorrélation passe en dessous d’une certaine valeur au \(n^{ème}\) lag. De cette manière, on trouve l’ordre idéal pour une série de type MA (Moving average).

acf(X)

On peut

La fonction d’autocorrélation partielle mesure la corrélation entre une série temporelle et elle-même décalée dans le temps, après avoir éliminé les varaitions déjà expliquées par les comparaisons intermédiaires (dus aux lags précédents). PAr exemple, pour le lag d’ordre 5, elle va comparer la corrélation avec l’instant t, mais elle supprime les effets adèjà expliqués par les lags 1 à 4.

A l’instar du graphique ACF, PACF passera en dessous d’une certaine valeur après un certain lag qui donne l’ordre d’une série de type AR (auto-regressive). Par exemple, si on prendre une série AR d’ordre 1 et si on exclut l’effet du lag d’ordre 1 (\(x(t-1)\)), le lag d’ordre 2 (\(x(t-2)\)) est indépendant de \(x(t)\). Donc la fonction d’autocorrélation partielle diminuera rapidement après le lag d’ordre 1.

pacf(X)

On peut constater que le graphique ACF nous indique un ordre MA de 8 et le graphique PACF nous indique un ordre AR de 1.

ARIMA : Auto-regressive integrated moving average

Number of AR (Auto-Regressive) terms (p): AR terms are just lags of dependent variable. For instance if p is 5, the predictors for x(t) will be x(t-1)..x(t-5). Number of MA (Moving Average) terms (q): MA terms are lagged forecast errors in prediction equation. For instance if q is 5, the predictors for x(t) will be e(t-1)..e(t-5) where e(i) is the difference between the moving average at ith instant and actual value. Number of Differences (d): These are the number of nonseasonal differences, i.e. in this case we took the first order difference. So either we can pass that variable and put d=0 or pass the original variable and put d=1. Both will generate same results.

q - The lag value where the ACF chart crosses the upper confidence interval for the first time p - The lag value where the PACF chart crosses the upper confidence interval for the first time.

\(p = 1\) and \(q = 8\)

# AR model 

# forecast window
f_win <- 168

x3 <- xts(siwim_data_hours$Count, order.by = siwim_data_hours$time_step)
attr(x3, 'frequency') <- 24

yt<-x3[1:(length(x3)-f_win)]

model_AR <- arima(yt, order = c(1,0,0))

pd_AR <- funggcast(x3, forecast(model_AR, h = f_win))
p1a<-ggplot(data=pd_AR,aes(x=date,y=observed)) 
p1a<-p1a+geom_line(col='red')
p1a<-p1a+geom_line(aes(y=fitted),col='blue')
p1a<-p1a+geom_line(aes(y=forecast))+geom_ribbon(aes(ymin=lo95,ymax=hi95),alpha=.25)
p1a<-p1a+scale_x_datetime(name='',breaks=date_breaks('8 hours'),minor_breaks= date_breaks('1 hour'),labels=date_format("%Y-%m-%d %H"),expand=c(0,0))
p1a<-p1a+scale_y_continuous(name='Units of Y')
p1a <- p1a + labs(title='Arima Fit to Simulated Data\n (black=forecast, blue=fitted, red=data, shadow=95% conf. interval)')
p1a<-p1a+theme(axis.text.x=element_text(size=10))

#p1a



dygraph(xts(x=pd_AR[,-c(1,5,6)], order.by = pd_AR[,1]), "Frequence des camions - ARIMA AR") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
#plot(forecast(model_AR, 168))


plot(model_AR$residuals)

acf(model_AR$residuals)

pacf(model_AR$residuals)

# MA model
model_MA <- arima(yt, order = c(0,0,8))
accuracy(model_MA)
##                      ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.01290228 22.13703 15.36456 -67.1872 82.54818 0.9898522
##                    ACF1
## Training set 0.01013776
pd_MA <- funggcast(x3, forecast(model_MA, h = f_win))

#plot(forecast(model_MA, 168))

dygraph(xts(x=pd_MA[,-c(1,5,6)], order.by = pd_MA[,1]), "Frequence des camions - ARIMA MA") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
plot(model_MA$residuals)

acf(model_MA$residuals)

pacf(model_MA$residuals)

# Composite model
model_ARMA <- arima(yt, order = c(1,0,8))
accuracy(model_ARMA)
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.03028622 21.43891 14.38407 -35.04558 59.21652 0.9266842
##                    ACF1
## Training set 0.02237713
#autoplot(forecast(model_ARMA, 168))

pd_ARMA <- funggcast(x3, forecast(model_ARMA, h = f_win))

dygraph(xts(x=pd_ARMA[,-c(1,5,6)], order.by = pd_ARMA[,1]), "Frequence des camions - ARIMA ARMA") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
autoplot(model_ARMA$residuals)

autoplot(acf(model_ARMA$residuals))

autoplot(pacf(model_ARMA$residuals))

# Auto ARIMA model
model_auto_ARMA <- auto.arima(yt)

accuracy(model_auto_ARMA)
##                      ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 0.02867587 20.37799 13.58248 -44.71983 64.20429 0.391899
##                      ACF1
## Training set 0.0004519754
#autoplot(forecast(model_auto_ARMA, 168))

pd_auto_ARIMA <- funggcast(x3, forecast(model_auto_ARMA, h = f_win))

dygraph(xts(x=pd_auto_ARIMA[,-c(1,5,6)], order.by = pd_auto_ARIMA[,1]), "Frequence des camions - ARIMA auto") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
autoplot(model_auto_ARMA$residuals)

autoplot(acf(model_auto_ARMA$residuals))

autoplot(pacf(model_auto_ARMA$residuals))

## auto arima with season

period <- 24 #first season

model_auto_ARMA_seas <- auto.arima(ts(data=yt,frequency = period))

accuracy(model_auto_ARMA_seas)
##                      ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 0.02867587 20.37799 13.58248 -44.71983 64.20429 0.391899
##                      ACF1
## Training set 0.0004519754
#summary(model_auto_ARMA_seas)
#autoplot(forecast(model_auto_ARMA_seas, 168))

pd_auto_ARIMA_seas <- funggcast(x3, forecast(model_auto_ARMA_seas, h = f_win))

dygraph(xts(x=pd_auto_ARIMA_seas[,-c(1,5,6)], order.by = pd_auto_ARIMA_seas[,1]), "Frequence des camions - ARIMA saison") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
autoplot(model_auto_ARMA_seas$residuals)

autoplot(acf(model_auto_ARMA_seas$residuals))

autoplot(pacf(model_auto_ARMA_seas$residuals))

## auto arima with double seasons

# Double seasons serie
x6 <- lag(lag(yt,k = 24), k =7)

model_auto_ARMA_double <- auto.arima(x6)

accuracy(model_auto_ARMA_double)
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.03641216 20.43041 13.51384 -41.33014 60.19032 0.3870543
##                       ACF1
## Training set -0.0007796341
#autoplot(forecast(model_auto_ARMA_double, 168))

pd_auto_ARIMA_double_seas <- funggcast(x3, forecast(model_auto_ARMA_double, h = f_win))

dygraph(xts(x=pd_auto_ARIMA_double_seas[,-c(1,5,6)], order.by = pd_auto_ARIMA_double_seas[,1]), "Frequence des camions - ARIMA double saison") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
plot(model_auto_ARMA_double$residuals)

#acf(model_auto_ARMA_double$residuals)
#pacf(model_auto_ARMA_double$residuals)
rbind.data.frame("Modèle AR" = accuracy(model_AR),
"Modèle MA" = accuracy(model_MA),
"Modèle ARMA" = accuracy(model_ARMA),
"Modèle auto ARIMA" = accuracy(model_auto_ARMA),
"Modèle ARIMA avec saison" = accuracy(model_auto_ARMA_seas),
"Modèle ARIMA avec double saison" = accuracy(model_auto_ARMA_double))  %>% kable()
ME RMSE MAE MPE MAPE MASE ACF1
Modèle AR 0.0551125 23.44477 15.64161 -42.58471 59.65201 1.0077004 0.2510202
Modèle MA 0.0129023 22.13703 15.36456 -67.18720 82.54818 0.9898522 0.0101378
Modèle ARMA 0.0302862 21.43891 14.38407 -35.04558 59.21652 0.9266842 0.0223771
Modèle auto ARIMA 0.0286759 20.37799 13.58248 -44.71983 64.20429 0.3918990 0.0004520
Modèle ARIMA avec saison 0.0286759 20.37799 13.58248 -44.71983 64.20429 0.3918990 0.0004520
Modèle ARIMA avec double saison 0.0364122 20.43041 13.51384 -41.33014 60.19032 0.3870543 -0.0007796

Décomposition temporelle (stl)

ts_Y <- as.ts(x3)
dekom <- stl(ts_Y, s.window = "periodic", robust = TRUE)
autoplot(dekom)

Holt winters & Exponential smoothing

#Simple exponential smoothing
siwim_data_hw_ses <- HoltWinters(yt ,beta = FALSE, 
                                 gamma = FALSE)

#plot(siwim_data_hw_ses)
pd_hw_ses <- funggcast(x3, forecast(siwim_data_hw_ses, h = f_win))


dygraph(xts(x=pd_hw_ses[,-c(1,5,6)], order.by = pd_hw_ses[,1]), "Frequence des camions - Simple Exponential Smoothing") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
#Exponential smoothing
siwim_data_hw_es <- HoltWinters(yt, gamma = FALSE)

#plot(siwim_data_hw_es)

pd_hw_es <- funggcast(x3, forecast(siwim_data_hw_es, h = f_win))


dygraph(xts(x=pd_hw_es[,-c(1,5,6)], order.by = pd_hw_es[,1]), "Frequence des camions - Exponential Smoothing") %>%
  dySeries("observed", label = "Real") %>%
  dySeries("fitted", label = "Fitted") %>%
  dySeries(c("lo95", "forecast", "hi95"), label = "Predicted") %>%
  dyRangeSelector()
rbind.data.frame("Sipmle Expoential Smoothing" = accuracy(forecast(siwim_data_hw_ses, h = f_win)), "Expoential Smoothing" =
accuracy(forecast(siwim_data_hw_es, h = f_win)))  %>% kable()
ME RMSE MAE MPE MAPE MASE ACF1
Sipmle Expoential Smoothing 0.0660460 23.83423 15.52235 -13.43676 40.45712 0.4478706 0.2272056
Expoential Smoothing -0.4874021 23.86568 15.59277 -17.08559 42.09580 0.4499023 0.2274684

Test d’autocorrélation d’ordre supérieur à 1 sur les résidus :

#Box test sur forecast
Box.test(forecast(siwim_data_hw_es, h = f_win)$residuals, lag = f_win, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  forecast(siwim_data_hw_es, h = f_win)$residuals
## X-squared = 8448, df = 168, p-value < 2.2e-16

Double saisonnalité

x2 <- msts(siwim_data_hours$Count, seasonal.periods = c(24, 24*7), start = 2017 + 07/12 + 5/365 + 23/8760)

# double_hw <- dshw(x2,24, 24*7)
# frequency(x2)

plot(x2)

# dygraph(x2, "Frequence des camions - Double saisonnalité") %>%
#   dyRangeSelector()

seasonaldecomp <- tbats(x2)
plot(seasonaldecomp)

accuracy(seasonaldecomp)
##                    ME     RMSE     MAE       MPE     MAPE     MASE
## Training set 1.720954 21.69751 13.8904 -20.31345 44.77766 0.886974
##                     ACF1
## Training set -0.01100627
plot(forecast(seasonaldecomp, 168 *4,level = 0.5))

seasonaldecomp2 <- tbats(siwim_data_hours$Count)
plot(seasonaldecomp2)

accuracy(seasonaldecomp2)
##                      ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -0.5908902 22.7422 14.54008 -18.59127 43.58727 0.9284597
##                    ACF1
## Training set 0.06288263
plot(forecast(seasonaldecomp2, 168 *4))

Modèles linéaires de base

Premier modèle simple

#First simple linear model

simple_model_freq <- lm(Count ~ 0 + Heure + Jour_semaine, data = siwim_data_hours)

summary(simple_model_freq)
## 
## Call:
## lm(formula = Count ~ 0 + Heure + Jour_semaine, data = siwim_data_hours)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -133.910  -20.320   -4.168   24.816  149.288 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## Heure00                25.770      3.475   7.416 1.61e-13 ***
## Heure01                24.277      3.453   7.031 2.59e-12 ***
## Heure02                27.132      3.453   7.858 5.59e-15 ***
## Heure03                32.208      3.510   9.175  < 2e-16 ***
## Heure04                38.442      3.442  11.170  < 2e-16 ***
## Heure05                53.700      3.548  15.134  < 2e-16 ***
## Heure06                86.848      3.464  25.073  < 2e-16 ***
## Heure07               129.342      3.453  37.462  < 2e-16 ***
## Heure08               128.230      3.442  37.259  < 2e-16 ***
## Heure09               132.535      3.442  38.509  < 2e-16 ***
## Heure10               135.792      3.453  39.331  < 2e-16 ***
## Heure11               127.555      3.431  37.179  < 2e-16 ***
## Heure12               126.727      3.431  36.938  < 2e-16 ***
## Heure13               130.038      3.420  38.021  < 2e-16 ***
## Heure14               142.964      3.431  41.671  < 2e-16 ***
## Heure15               144.025      3.431  41.980  < 2e-16 ***
## Heure16               131.970      3.442  38.345  < 2e-16 ***
## Heure17               113.281      3.442  32.915  < 2e-16 ***
## Heure18                88.907      3.442  25.833  < 2e-16 ***
## Heure19                66.185      3.453  19.169  < 2e-16 ***
## Heure20                48.580      3.453  14.070  < 2e-16 ***
## Heure21                38.194      3.453  11.062  < 2e-16 ***
## Heure22                30.222      3.510   8.609  < 2e-16 ***
## Heure23                26.888      3.499   7.685 2.12e-14 ***
## Jour_semainemardi       3.946      2.356   1.674 0.094170 .  
## Jour_semainemercredi    6.431      2.311   2.782 0.005433 ** 
## Jour_semainejeudi       8.835      2.339   3.777 0.000162 ***
## Jour_semainevendredi   -6.438      2.374  -2.711 0.006747 ** 
## Jour_semainesamedi    -71.354      2.389 -29.867  < 2e-16 ***
## Jour_semainedimanche  -77.516      2.430 -31.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.9 on 2698 degrees of freedom
## Multiple R-squared:  0.8793, Adjusted R-squared:  0.8779 
## F-statistic: 655.1 on 30 and 2698 DF,  p-value: < 2.2e-16
plot(
  simple_model_freq$coefficients[1:23], type = "h", xlab = "Heure", ylab = "coefficient")

# simple_model_freq$residuals

#Diagnostic de la regression
#plot(siwim_data_hours$Heure,rstudent(simple_model_freq))

#mesusres d'influence du modèle dont hii
mes <- influence.measures(simple_model_freq)

#graph des hii
plot(sort(mes$infmat[,"hat"]), type ="h")

#Prediction du modèle simple
pred <- predict(simple_model_freq, cbind.data.frame(Heure = siwim_data_hours$Heure, Mois_annee = siwim_data_hours$Mois_annee, Jour_semaine = siwim_data_hours$Jour_semaine))


# plot(pred, type = "l", col = "red")
# lines(siwim_data_hours$Count, col = "green")

### Comparison graphs of predicted vs real
datas <- rbindlist(list(siwim_data_hours[, .(Count, time_step)],
                        data.table(Count = simple_model_freq$fitted.values, 
                                   time_step = siwim_data_hours[, time_step])))
datas[, type := rep(c("Real", "Fitted"), each = nrow(siwim_data_hours))]

ggplot(data = datas, aes(time_step, Count, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Frequency",
       title = "Fit from simple MLR")

## Visu for residuals

ggplot(data = data.table(Fitted_values = simple_model_freq$fitted.values,
                         Residuals = simple_model_freq$residuals),
       aes(Fitted_values, Residuals)) +
  geom_point(size = 1.7) +
  geom_smooth() +
  geom_hline(yintercept = 0, color = "red", size = 1) +
  labs(title = "Fitted values vs Residuals")
## `geom_smooth()` using method = 'gam'

## Function to visualize normaliy of residuals
ggQQ <- function(lm){
  # extract standardized residuals from the fit
  d <- data.frame(std.resid = rstandard(lm))
  # calculate 1Q/4Q line
  y <- quantile(d$std.resid[!is.na(d$std.resid)], c(0.25, 0.75))
  x <- qnorm(c(0.25, 0.75))
  slope <- diff(y)/diff(x)
  int <- y[1L] - slope * x[1L]
  
  p <- ggplot(data = d, aes(sample = std.resid)) +
    stat_qq(shape = 1, size = 3) +         # open circles
    labs(title = "Normal Q-Q",             # plot title
         x = "Theoretical Quantiles",      # x-axis label
         y = "Standardized Residuals") +   # y-axis label
    geom_abline(slope = slope, intercept = int, linetype = "dashed",
                size = 1, col = "firebrick1") # dashed reference line
  return(p)
}

ggQQ(simple_model_freq)

Modèle avec interactions

################# Linear models with interactions #####################################
model_interact_freq <- lm(Count ~ 0 + Heure + Jour_semaine + Heure:Jour_semaine, 
                          data = siwim_data_hours)

summary(model_interact_freq)
## 
## Call:
## lm(formula = Count ~ 0 + Heure + Jour_semaine + Heure:Jour_semaine, 
##     data = siwim_data_hours)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -152.667   -3.891    0.250    6.250  122.118 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## Heure00                         5.1875     5.1800   1.001 0.316704    
## Heure01                         7.9375     5.1800   1.532 0.125562    
## Heure02                         9.0000     5.1800   1.737 0.082427 .  
## Heure03                        13.1250     5.1800   2.534 0.011343 *  
## Heure04                        17.0625     5.1800   3.294 0.001001 ** 
## Heure05                        41.6250     5.1800   8.036 1.41e-15 ***
## Heure06                        71.0000     5.1800  13.707  < 2e-16 ***
## Heure07                       127.1250     5.1800  24.542  < 2e-16 ***
## Heure08                       140.7500     5.1800  27.172  < 2e-16 ***
## Heure09                       150.1250     5.1800  28.982  < 2e-16 ***
## Heure10                       158.5625     5.1800  30.611  < 2e-16 ***
## Heure11                       150.2500     5.1800  29.006  < 2e-16 ***
## Heure12                       148.6875     5.1800  28.704  < 2e-16 ***
## Heure13                       150.2500     5.1800  29.006  < 2e-16 ***
## Heure14                       169.1250     5.1800  32.650  < 2e-16 ***
## Heure15                       165.3750     5.1800  31.926  < 2e-16 ***
## Heure16                       146.9375     5.1800  28.366  < 2e-16 ***
## Heure17                       113.8125     5.1800  21.972  < 2e-16 ***
## Heure18                        86.2500     5.1800  16.651  < 2e-16 ***
## Heure19                        61.3750     5.1800  11.848  < 2e-16 ***
## Heure20                        37.3125     5.1800   7.203 7.70e-13 ***
## Heure21                        27.8750     5.1800   5.381 8.07e-08 ***
## Heure22                        17.3125     5.1800   3.342 0.000843 ***
## Heure23                        13.7500     5.1800   2.654 0.007993 ** 
## Jour_semainemardi               4.5625     7.3256   0.623 0.533461    
## Jour_semainemercredi            4.7569     7.1192   0.668 0.504076    
## Jour_semainejeudi               6.7537     7.2171   0.936 0.349469    
## Jour_semainevendredi            6.5625     7.3256   0.896 0.370428    
## Jour_semainesamedi              2.5625     7.3256   0.350 0.726517    
## Jour_semainedimanche           -2.8798     7.7367  -0.372 0.709755    
## Heure01:Jour_semainemardi      -3.4412    10.2835  -0.335 0.737931    
## Heure02:Jour_semainemardi      -1.0625    10.3600  -0.103 0.918322    
## Heure03:Jour_semainemardi      -0.8125    10.3600  -0.078 0.937495    
## Heure04:Jour_semainemardi       2.5625    10.3600   0.247 0.804660    
## Heure05:Jour_semainemardi      -1.4375    10.3600  -0.139 0.889655    
## Heure06:Jour_semainemardi      15.6250    10.3600   1.508 0.131625    
## Heure07:Jour_semainemardi      18.8750    10.3600   1.822 0.068584 .  
## Heure08:Jour_semainemardi      -1.4375    10.3600  -0.139 0.889655    
## Heure09:Jour_semainemardi     -10.6875    10.3600  -1.032 0.302350    
## Heure10:Jour_semainemardi      -8.5000    10.3600  -0.820 0.412027    
## Heure11:Jour_semainemardi     -11.8125    10.3600  -1.140 0.254307    
## Heure12:Jour_semainemardi     -11.1875    10.3600  -1.080 0.280299    
## Heure13:Jour_semainemardi      -8.9301    10.2835  -0.868 0.385260    
## Heure14:Jour_semainemardi      -8.7463    10.2835  -0.851 0.395116    
## Heure15:Jour_semainemardi      -2.9375    10.2835  -0.286 0.775168    
## Heure16:Jour_semainemardi       7.9706    10.2835   0.775 0.438362    
## Heure17:Jour_semainemardi      22.1544    10.2835   2.154 0.031305 *  
## Heure18:Jour_semainemardi       9.2463    10.2835   0.899 0.368663    
## Heure19:Jour_semainemardi       1.3566    10.2835   0.132 0.895057    
## Heure20:Jour_semainemardi      -2.6397    10.2835  -0.257 0.797436    
## Heure21:Jour_semainemardi      -9.2022    10.2835  -0.895 0.370951    
## Heure22:Jour_semainemardi      -3.5809    10.2835  -0.348 0.727707    
## Heure23:Jour_semainemardi      -7.0184    10.2835  -0.682 0.494992    
## Heure01:Jour_semainemercredi   -2.3787    10.0056  -0.238 0.812107    
## Heure02:Jour_semainemercredi   -0.8096    10.0056  -0.081 0.935518    
## Heure03:Jour_semainemercredi    2.4122    10.1375   0.238 0.811943    
## Heure04:Jour_semainemercredi    2.6250    10.0681   0.261 0.794326    
## Heure05:Jour_semainemercredi    3.3958    10.0681   0.337 0.735928    
## Heure06:Jour_semainemercredi   17.1319    10.0681   1.702 0.088950 .  
## Heure07:Jour_semainemercredi   16.0069    10.0681   1.590 0.111988    
## Heure08:Jour_semainemercredi   -1.0625    10.0681  -0.106 0.915962    
## Heure09:Jour_semainemercredi   -0.4931    10.0681  -0.049 0.960945    
## Heure10:Jour_semainemercredi  -13.3750    10.0681  -1.328 0.184146    
## Heure11:Jour_semainemercredi  -13.2175    10.0056  -1.321 0.186615    
## Heure12:Jour_semainemercredi   -8.0234    10.0056  -0.802 0.422689    
## Heure13:Jour_semainemercredi   -7.6385    10.0056  -0.763 0.445278    
## Heure14:Jour_semainemercredi   -0.3819    10.0681  -0.038 0.969742    
## Heure15:Jour_semainemercredi    9.5347    10.0681   0.947 0.343716    
## Heure16:Jour_semainemercredi    7.4167    10.0681   0.737 0.461402    
## Heure17:Jour_semainemercredi   28.3129    10.1375   2.793 0.005263 ** 
## Heure18:Jour_semainemercredi   11.5813    10.1375   1.142 0.253387    
## Heure19:Jour_semainemercredi    5.2210    10.1375   0.515 0.606586    
## Heure20:Jour_semainemercredi   -1.8930    10.1375  -0.187 0.851888    
## Heure21:Jour_semainemercredi   -8.0437    10.1375  -0.793 0.427584    
## Heure22:Jour_semainemercredi   -4.8342    10.1375  -0.477 0.633505    
## Heure23:Jour_semainemercredi   -3.9775    10.1375  -0.392 0.694827    
## Heure01:Jour_semainejeudi      -6.4134    10.1375  -0.633 0.527025    
## Heure02:Jour_semainejeudi      -4.0315    10.1375  -0.398 0.690902    
## Heure03:Jour_semainejeudi       0.4743    10.2065   0.046 0.962942    
## Heure04:Jour_semainejeudi       2.7132    10.2065   0.266 0.790388    
## Heure05:Jour_semainejeudi       0.3272    10.2065   0.032 0.974428    
## Heure06:Jour_semainejeudi      20.6581    10.2065   2.024 0.043072 *  
## Heure07:Jour_semainejeudi      19.5102    10.1375   1.925 0.054396 .  
## Heure08:Jour_semainejeudi       3.2741    10.1375   0.323 0.746745    
## Heure09:Jour_semainejeudi      -8.2676    10.1375  -0.816 0.414839    
## Heure10:Jour_semainejeudi      -3.3750    10.2065  -0.331 0.740919    
## Heure11:Jour_semainejeudi     -11.8860    10.2065  -1.165 0.244307    
## Heure12:Jour_semainejeudi     -10.5588    10.2065  -1.035 0.300990    
## Heure13:Jour_semainejeudi      -4.1801    10.2065  -0.410 0.682165    
## Heure14:Jour_semainejeudi       0.1213    10.2065   0.012 0.990517    
## Heure15:Jour_semainejeudi       9.0478    10.2065   0.886 0.375445    
## Heure16:Jour_semainejeudi      19.4265    10.2065   1.903 0.057107 .  
## Heure17:Jour_semainejeudi      17.7279    10.2065   1.737 0.082520 .  
## Heure18:Jour_semainejeudi      12.9375    10.2065   1.268 0.205065    
## Heure19:Jour_semainejeudi       2.4338    10.2835   0.237 0.812930    
## Heure20:Jour_semainejeudi       3.6838    10.2835   0.358 0.720204    
## Heure21:Jour_semainejeudi      -3.8787    10.2835  -0.377 0.706075    
## Heure22:Jour_semainejeudi      -4.3787    10.2835  -0.426 0.670293    
## Heure23:Jour_semainejeudi      -9.0037    10.2835  -0.876 0.381360    
## Heure01:Jour_semainevendredi   -4.0000    10.3600  -0.386 0.699454    
## Heure02:Jour_semainevendredi   -1.6875    10.3600  -0.163 0.870621    
## Heure03:Jour_semainevendredi   -0.1250    10.3600  -0.012 0.990374    
## Heure04:Jour_semainevendredi    4.1250    10.3600   0.398 0.690540    
## Heure05:Jour_semainevendredi    5.5625    10.3600   0.537 0.591368    
## Heure06:Jour_semainevendredi   18.9375    10.3600   1.828 0.067674 .  
## Heure07:Jour_semainevendredi   21.8125    10.3600   2.105 0.035349 *  
## Heure08:Jour_semainevendredi    3.0000    10.3600   0.290 0.772164    
## Heure09:Jour_semainevendredi   -1.3750    10.3600  -0.133 0.894423    
## Heure10:Jour_semainevendredi  -18.6875    10.3600  -1.804 0.071377 .  
## Heure11:Jour_semainevendredi  -19.2500    10.3600  -1.858 0.063268 .  
## Heure12:Jour_semainevendredi  -24.3750    10.3600  -2.353 0.018708 *  
## Heure13:Jour_semainevendredi  -17.0625    10.3600  -1.647 0.099688 .  
## Heure14:Jour_semainevendredi  -35.1250    10.3600  -3.390 0.000708 ***
## Heure15:Jour_semainevendredi  -31.8750    10.3600  -3.077 0.002115 ** 
## Heure16:Jour_semainevendredi  -43.8125    10.3600  -4.229 2.43e-05 ***
## Heure17:Jour_semainevendredi  -40.2500    10.3600  -3.885 0.000105 ***
## Heure18:Jour_semainevendredi  -34.6875    10.3600  -3.348 0.000825 ***
## Heure19:Jour_semainevendredi  -30.5000    10.3600  -2.944 0.003269 ** 
## Heure20:Jour_semainevendredi  -20.1250    10.3600  -1.943 0.052177 .  
## Heure21:Jour_semainevendredi  -18.3125    10.3600  -1.768 0.077244 .  
## Heure22:Jour_semainevendredi  -12.5000    10.3600  -1.207 0.227711    
## Heure23:Jour_semainevendredi  -11.6875    10.3600  -1.128 0.259367    
## Heure01:Jour_semainesamedi     -2.8750    10.3600  -0.278 0.781411    
## Heure02:Jour_semainesamedi     -2.6250    10.3600  -0.253 0.799996    
## Heure03:Jour_semainesamedi     -1.0000    10.3600  -0.097 0.923111    
## Heure04:Jour_semainesamedi     -4.4375    10.3600  -0.428 0.668446    
## Heure05:Jour_semainesamedi    -28.1250    10.3600  -2.715 0.006677 ** 
## Heure06:Jour_semainesamedi    -56.5000    10.3600  -5.454 5.41e-08 ***
## Heure07:Jour_semainesamedi   -107.8125    10.3600 -10.407  < 2e-16 ***
## Heure08:Jour_semainesamedi   -123.1250    10.3600 -11.885  < 2e-16 ***
## Heure09:Jour_semainesamedi   -129.3125    10.3600 -12.482  < 2e-16 ***
## Heure10:Jour_semainesamedi   -137.2500    10.3600 -13.248  < 2e-16 ***
## Heure11:Jour_semainesamedi   -131.9375    10.3600 -12.735  < 2e-16 ***
## Heure12:Jour_semainesamedi   -129.6875    10.3600 -12.518  < 2e-16 ***
## Heure13:Jour_semainesamedi   -134.1250    10.3600 -12.946  < 2e-16 ***
## Heure14:Jour_semainesamedi   -155.8125    10.3600 -15.040  < 2e-16 ***
## Heure15:Jour_semainesamedi   -151.1875    10.3600 -14.593  < 2e-16 ***
## Heure16:Jour_semainesamedi   -134.0333    10.4460 -12.831  < 2e-16 ***
## Heure17:Jour_semainesamedi   -102.5625    10.3600  -9.900  < 2e-16 ***
## Heure18:Jour_semainesamedi    -74.1250    10.3600  -7.155 1.09e-12 ***
## Heure19:Jour_semainesamedi    -56.5625    10.3600  -5.460 5.23e-08 ***
## Heure20:Jour_semainesamedi    -33.7500    10.3600  -3.258 0.001138 ** 
## Heure21:Jour_semainesamedi    -26.5000    10.3600  -2.558 0.010587 *  
## Heure22:Jour_semainesamedi    -16.8750    10.7830  -1.565 0.117715    
## Heure23:Jour_semainesamedi    -13.8125    10.7830  -1.281 0.200327    
## Heure01:Jour_semainedimanche   -2.9744    11.0664  -0.269 0.788125    
## Heure02:Jour_semainedimanche   -4.6587    10.9414  -0.426 0.670301    
## Heure03:Jour_semainedimanche   -8.6997    11.2124  -0.776 0.437877    
## Heure04:Jour_semainedimanche  -12.4952    10.6546  -1.173 0.241006    
## Heure05:Jour_semainedimanche  -37.1738    12.1663  -3.055 0.002270 ** 
## Heure06:Jour_semainedimanche  -64.1202    10.8330  -5.919 3.67e-09 ***
## Heure07:Jour_semainedimanche -120.7452    10.8330 -11.146  < 2e-16 ***
## Heure08:Jour_semainedimanche -132.4702    10.7382 -12.336  < 2e-16 ***
## Heure09:Jour_semainedimanche -137.0452    10.7382 -12.762  < 2e-16 ***
## Heure10:Jour_semainedimanche -141.4160    10.7382 -13.169  < 2e-16 ***
## Heure11:Jour_semainedimanche -131.1827    10.6546 -12.312  < 2e-16 ***
## Heure12:Jour_semainedimanche -131.1827    10.6546 -12.312  < 2e-16 ***
## Heure13:Jour_semainedimanche -131.6202    10.6546 -12.353  < 2e-16 ***
## Heure14:Jour_semainedimanche -147.2452    10.6546 -13.820  < 2e-16 ***
## Heure15:Jour_semainedimanche -147.0577    10.6546 -13.802  < 2e-16 ***
## Heure16:Jour_semainedimanche -129.0577    10.6546 -12.113  < 2e-16 ***
## Heure17:Jour_semainedimanche  -91.6827    10.6546  -8.605  < 2e-16 ***
## Heure18:Jour_semainedimanche  -66.1827    10.6546  -6.212 6.10e-10 ***
## Heure19:Jour_semainedimanche  -46.4327    10.6546  -4.358 1.36e-05 ***
## Heure20:Jour_semainedimanche  -23.0577    10.6546  -2.164 0.030549 *  
## Heure21:Jour_semainedimanche  -17.8077    10.6546  -1.671 0.094773 .  
## Heure22:Jour_semainedimanche   -9.3660    10.7382  -0.872 0.383175    
## Heure23:Jour_semainedimanche   -6.8077    10.6546  -0.639 0.522918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.72 on 2560 degrees of freedom
## Multiple R-squared:  0.9546, Adjusted R-squared:  0.9516 
## F-statistic: 320.2 on 168 and 2560 DF,  p-value: < 2.2e-16
c(Previous = summary(simple_model_freq)$r.squared, New = summary(model_interact_freq)$r.squared)
##  Previous       New 
## 0.8792895 0.9545746
## Residuals comparison
ggplot(data.table(Residuals = c(simple_model_freq$residuals, model_interact_freq$residuals),
                  Type = c(rep("MLR - simple", nrow(siwim_data_hours)),
                           rep("MLR with interactions", nrow(siwim_data_hours)))), 
       aes(Type, Residuals, fill = Type)) +
  geom_boxplot()

ggplotly()
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Graph comparison of predictred vs real values
datas <- rbindlist(list(siwim_data_hours[, .(Count, time_step)],
                        data.table(Count = model_interact_freq$fitted.values, 
                                   time_step = siwim_data_hours[, time_step])))
datas[, type := rep(c("Real", "Fitted"), each = nrow(siwim_data_hours))]

ggplot(data = datas, aes(time_step, Count, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Frequency",
       title = "Fit from MLR with interactions")

## Visu of residuals
ggplot(data = data.table(Fitted_values = model_interact_freq$fitted.values,
                         Residuals = model_interact_freq$residuals),
       aes(Fitted_values, Residuals)) +
  geom_point(size = 1.7) +
  # geom_smooth() +
  geom_hline(yintercept = 0, color = "red", size = 1) +
  labs(title = "Fitted values vs Residuals")

ggQQ(model_interact_freq)

Sélection de variables de lags par stepwise

## lags

nb_lags <- 168

lags <- paste("lag", 1:nb_lags, sep = "_")

## Build matrices with dummmy variables

siwim_data_hours_X <- siwim_data_hours[, c("Count", "Heure", "Jour_semaine")]

siwim_data_hours_X[ , (lags) := shift(siwim_data_hours$Count, 1:nb_lags)]

model_step <- step(lmfit, direction = "both", k = log(nrow(siwim_data_hours_X)))

Résultats de la sélection :

##          (Intercept)              Heure01              Heure02 
##           9.59421759          -1.09020865           0.54205982 
##              Heure03              Heure04              Heure05 
##           2.64088171           1.63177828          11.86325680 
##              Heure06              Heure07              Heure08 
##          26.84855041          43.32919136           9.67081567 
##              Heure09              Heure10              Heure11 
##          11.92554829          17.92234708          10.30276396 
##              Heure12              Heure13              Heure14 
##          12.39874845          19.02926018          28.53946273 
##              Heure15              Heure16              Heure17 
##          21.89094857          11.08178061           2.39168501 
##              Heure18              Heure19              Heure20 
##          -3.62807002          -6.43401699          -7.42600959 
##              Heure21              Heure22              Heure23 
##          -5.58118969          -4.73008825          -2.14588119 
## Jour_semainedimanche    Jour_semainejeudi    Jour_semainemardi 
##         -12.31223273           2.75519186          -2.16598716 
## Jour_semainemercredi   Jour_semainesamedi Jour_semainevendredi 
##           3.02810112         -12.31277735           0.92405967 
##                lag_1                lag_2                lag_4 
##           0.75114780           0.14287950          -0.05637018 
##                lag_9               lag_15               lag_23 
##          -0.10768881           0.05869562           0.05299710 
##               lag_27               lag_29               lag_45 
##          -0.08286029           0.05671318          -0.06393391 
##               lag_95              lag_100              lag_111 
##          -0.04298543           0.03068071          -0.02257531 
##              lag_143              lag_164              lag_168 
##           0.02357880           0.03073459          -0.02753100 
##               lag_90 
##           0.02578676

Valdation croisée pour les séries temporelles

Les séries temporelles sont des observations ordonnées chronologiquement. Ainsi, contrairment aux données non temporelles, leur ordre a une importance dont il faut tenir compte lors de la validation croisée. Pour cela 2 approches existent :

Validation croisée sur série temporelle